fix(infra): Resolve Race Condition in Parallel Base Image Builds#14189
fix(infra): Resolve Race Condition in Parallel Base Image Builds#14189
Conversation
The parallel execution of base image builds was causing a race condition, where dependent images (e.g., `base-builder`) were starting their build process before their base images (e.g., `base-clang`) had finished building in the same pipeline. This resulted in the builder pulling the previously existing, and incorrect, base image from the registry (e.g., an Ubuntu 20.04-based image instead of the new 24.04 version). This commit fixes the issue by introducing an `IMAGE_DEPENDENCIES` map that explicitly defines the build order. The `get_base_image_steps` function now uses this map to add `waitFor` clauses to the Google Cloud Build steps, ensuring that images are built sequentially according to their dependency graph.
|
/gcbrun trial_build.py zlib --fuzzing-engines libfuzzer --sanitizers address |
|
/gcbrun trial_build.py zlib --fuzzing-engines libfuzzer --sanitizers address |
|
/gcbrun skip |
This reverts the change that added libssl-dev to the ubuntu-24-04 base-runner image. This dependency was found to be unnecessary as the build works without it, as seen in trial builds.
|
/gcbrun skip |
|
/gcbrun skip |
|
FWIW with this PR merged #14157 appears to have been fixed. The latest 24-04 base-builder image come with Ubuntu 24.04. |
|
I think this caused a regression for trial runs where we want This comment is meant to trigger a trial run for all projects: but the log shows: Step #1 - "Legacy": INFO:root:================================================================
Step #1 - "Legacy": INFO:root: PHASE 2: STARTING TEST BUILDS
Step #1 - "Legacy": INFO:root:================================================================
Step #1 - "Legacy": INFO:root:Build type: fuzzing
Step #1 - "Legacy": INFO:root: - Selected projects: 309 / 1323 (due to failed production builds)
Step #1 - "Legacy": INFO:root: - To build all projects, use the --force-build flag.
Step #1 - "Legacy": INFO:root:Starting to create and trigger builds for build type: fuzzing
Step #3 - "Ubuntu 24.04": INFO:root:Triggered all builds.
Step #3 - "Ubuntu 24.04": INFO:root:================================================================
Step #3 - "Ubuntu 24.04": INFO:root: PHASE 2: SKIPPED BUILDS
Step #3 - "Ubuntu 24.04": INFO:root:================================================================
Step #3 - "Ubuntu 24.04": INFO:root:Total skipped builds: 1060
Step #3 - "Ubuntu 24.04": INFO:root:--- SKIPPED BUILDS ---
Step #3 - "Ubuntu 24.04": INFO:root: - fuzzing:
Step #3 - "Ubuntu 24.04": INFO:root: - abseil-py: Production build succeeded
Step #3 - "Ubuntu 24.04": INFO:root: - ada-url: Production build succeededi.e. all projects that currently build successfully are actually skipped. Notice we primarily want the successful projects to run because we want to make sure no regressions happens in the various projects. I assume it's these lines here: oss-fuzz/infra/build/functions/trial_build.py Lines 197 to 198 in 4b541c7 not was added in this PR.
|
Summary
This PR fixes a critical race condition in the base image build process that caused the
gcr.io/oss-fuzz-base/base-builder:ubuntu-24-04image to be incorrectly built with an Ubuntu 20.04 base.The fix ensures build steps are executed in the correct order by explicitly defining a dependency graph, guaranteeing that versioned images are always built on top of their corresponding, freshly-built base layers.
The Problem
A report indicated that the
base-builder:ubuntu-24-04image contained Ubuntu 20.04. An initial investigation confirmed this behavior.Investigation Steps
Dockerfile Verification: The entire dependency chain of Dockerfiles was inspected:
base-builder:ubuntu-24-04correctly usedFROM base-clang:ubuntu-24-04.base-clang:ubuntu-24-04correctly usedFROM base-image:ubuntu-24-04.base-image:ubuntu-24-04correctly usedFROM ubuntu:24.04.This ruled out any static configuration errors in the Dockerfiles themselves.
Build Process Analysis: A
dry-runof theinfra/build/functions/base_images.pyscript revealed that all build steps for the different base images were being generated to run in parallel in Google Cloud Build.Root Cause: Race Condition
The parallel execution was the source of the problem. Because the builds for
base-image,base-clang, andbase-builderwere triggered simultaneously, a race condition occurred:base-builder:ubuntu-24-04build would start.gcr.io/oss-fuzz-base/base-clang:ubuntu-24-04.base-clang:ubuntu-24-04had not yet finished.The same issue was happening between
base-clangandbase-image.The Solution
To resolve this, we now enforce a sequential build order that respects the image dependency hierarchy.
Dependency Map: An
IMAGE_DEPENDENCIESdictionary was introduced ininfra/build/functions/base_images.pyto define the explicit build order (e.g.,base-clangdepends onbase-image).Sequential Build Steps: The
get_base_image_stepsfunction was updated to read this map and inject awaitForclause into each Google Cloud Build step. This forces GCB to wait for a dependency to finish building before starting the next step in the chain.Verification
A
dry-runwas executed after the fix, and the generated build steps now correctly reflect the sequential dependency order. A full build was also triggered, confirming that the fix works in a real environment and produces the correct image.This change ensures the integrity and correctness of our base images without sacrificing the parallelism between different Ubuntu version builds (e.g., the
ubuntu-20-04andubuntu-24-04builds still run in parallel with each other).